1 Background

We hope to explore the relative influence of physical traits, environmental conditions and species identity on the growth rate of trees. We have two measures of growth rate - basal area and biomass. A gradient boosted model seems like a good candidate for this work since they:

1.1 About Gradient Boosted Models

A gradient boosted machine/model is a machine learning model that uses decision trees to fit the data.

A decision tree first starts with all of the observations, then, from the variables provided, it tries to figure out which variable split would result in the “purest” groupings of the data. So, in this case, it would try to place rows with higher growth rates in one node, and those with lower growth rates in another node.

GBMs are an ensemble of decision trees, nut they are fit sequentially. We call GBMs an ensemble of weak learners as each subsequent tree is an attempt to correct the errors of the previous tree. Thus, while one tree, by itself, can not describe the relationships, with the use of all the trees, we can. Below is a figure by Bradly Bohemke that attempts to illustrate how each subsequent tree improves the fit on the data. Boosted regression decision stumps as 0-1024 successive trees are added

1.2 Changes made to raw data

Below is a list of the changes made to the raw data:

  • Environmental variables - these are the principle components (1-5) from a PCA that was fit on all the environmental variables. We visualized the PCA and used the eginvectors to help figure which environmental condition best explained that PC. There were 5 - Soil Fertility, Light, Temperature, pH, Soil.Humidity.Depth, and Slope.

2 Data Preparation

First, we can import data and select the labels needed for plotting later.

# prior error suggests that there are single quotation marks used to parse numbers
# I specify that we should read those as a thousand separator
# we will no longer used the data with the influence of sampling date removed
rgr_msh_julian <- read_csv(here("data/RGR_MSH_PCA_NA.csv"),
                        guess_max = 10000,
                        col_types = cols())
# these are the labels for the variables - needed for plotting
labels_rgr_msh <- read_csv(here("data/labels.csv"),
                        guess_max = 10000,
                        col_types = cols())
# this helps in listing the variables we actually need to keep, depending on the analyses
PlantTraitsKeep <- labels_rgr_msh %>% filter(Class == 1) %>% select(Feature) %>% reduce(c)
EnvironmentalVariablesKeep <- c( "Soil.Fertility", "Light", "Temperature",  
                                    "pH", "Soil.Humidity.Depth ", "Slope")
OthersKeep <- labels_rgr_msh %>% filter(Class%in% c(3,4, 5,7) )%>% select(Feature) %>% reduce(c) 
OthersKeep <- c("SampleID", "Site", "Species", "julian.date.2011", 
                OthersKeep[!OthersKeep%in% c("SampleID", "Site", "Species", "julian.date.2011")])

Now, we can create the training and test set - 70:30 is default, usually.

# need a training and test set assess the performance of the model
# a 70:30 split is typical
# first, I will work with the imputed data
set.seed(634)
split_train_test <-
  initial_split(
    data = rgr_msh_julian, 
    prop = 0.80) 

rgr_msh_na <-
  split_train_test %>% training() %>% mutate(Group = "Train")%>%
  bind_rows(split_train_test %>% testing() %>% mutate(Group = "Test"))

3 Gradient Boosted Models

3.1 BIO ~ PlantTraits + TreeAge + EnvironmentalVariables

3.1.1 Tune Model

# training the model using H20 - made a function that allows us to do this
# takes ~ 3 minutes
startTime <- Sys.time()
gbmParmBIONoDate <-
  h20_gbm_tune(path = here("data/RGRH20BIO.csv"),
             status = "BIO_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.1.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_11"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 12
## 
## $min_rows
## [1] 2
## 
## $nbins
## [1] 32
## 
## $nbins_cats
## [1] 32
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3590.617
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.9
## 
## $col_sample_rate
## [1] 0.93
## 
## $col_sample_rate_per_tree
## [1] 0.2
## 
## $min_split_improvement
## [1] 1e-08
## 
## $histogram_type
## [1] "RoundRobin"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                 "Slope"             
##  [6] "Estem"              "Branching.Distance" "Stem.Wood.Density"  "Leaf.Area"          "LMA"               
## [11] "LCC"                "LNC"                "LPC"                "d15N"               "t.b2"              
## [16] "Ks"                 "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Tree.Age"          
## 
## $y
## [1] "BIO_GR"

3.1.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bio_residuals <-
  gbm(BIO_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BIO_GR"))),
      n.trees = 1000,
      interaction.depth = 12, # max depth
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 22, #col_sample_rate 
      bag.fraction = 0.9, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.1.4 Relative Importance

First, we look at the importance of variables in the model.

3.1.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass Per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity Per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta 13C

Tree Age

3.1.6 Performance - True Traits Value

How does the model perform when we use the true individual trait value?

3.1.7 Performance - Species Means for Trait Value

Does the test error change when we use the average for that species?

3.1.8 Interactions | Table

Let’s explore the interactions in these data.

3.1.9 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 26.678, df = 3, p-value = 6.879e-06

3.1.10 Interaction | Group | Violin

3.1.11 Interaction | Group | Boxplot

3.1.12 Interaction | Group | Density

3.1.13 Interactions | Plots

Now, we plot interactions with values>0.1.

Leaf Carbon Concentration:Thickness to Span Ratio

Soil Fertility:Leaf Phosphorus Concentration

Stem Wood Density:Conductivity Per Sapwood Area

Slope:Leaf Phosphorus Concentration

Thickness to Span Ratio:Tree Age

Vessel Diameter:Temperature

Branching Distance:Vessel Diameter

Soil Fertility:Leaf Carbon Concentration

Leaf Nitrogen Concentration:Thickness to Span Ratio

Temperature:Branching Distance

Conductivity per Branch:Percent Lumen

Leaf Carbon Concentration:Leaf Nitrogen Concentration

Leaf Phosphorus Concentration:Percent Lumen

Temperature:Huber Value

Soil Fertility:Thickness to Span Ratio

Leaf Carbon Concentration:Tree Age

Leaf Nitrogen Concentration:Leaf Phosphorus Concentration

Modulus of Elasticity for Stem:Leaf Phosphorus Concentration

Soil Fertility:Slope

Temperature:Leaf Phosphorus Concentration

Soil Fertility:Temperature

Thickness to Span Ratio:Temperature

Soil Fertility:Tree Age

Stem Wood Density:Percent Lumen

Vessel Diameter:Conductivity Per Sapwood Area

Leaf Mass Per Area:Modulus of Elasticity for Stem

Leaf Nitrogen Concentration:Delta 13C

Conductivity Per Sapwood Area:Delta 13C

Soil Fertility:Leaf Nitrogen Concentration

Slope:Vessel Diameter

Leaf Phosphorus Concentration:Leaf Carbon Concentration

Light:Conductivity Per Sapwood Area

Percent Lumen:Temperature

Temperature:Leaf Nitrogen Concentration

pH:Leaf Area

Percent Lumen:Delta 13C

Light:Vessel Diameter

Slope:Thickness to Span Ratio

Light:Temperature

Conductivity Per Sapwood Area:Temperature

Tree Age:Percent Lumen

Leaf Mass Per Area:Vessel Diameter

Thickness to Span Ratio:Leaf Phosphorus Concentration

Branching Distance:Percent Sapwood

Leaf Carbon Concentration:Temperature

Huber Value:Percent Sapwood

Percent Lumen:Delta 15N

Temperature:Leaf Mass Per Area

Leaf Phosphorus Concentration:Light

3.1.14 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

3.2 BAI ~ PlantTraits + TreeAge + EnvironmentalVariables

3.2.1 Tune Model

# training the model using H20
startTime <- Sys.time()
gbmParmBAINoDate <-
  h20_gbm_tune(path = here("data/RGRH20BAI.csv"),
             status = "BAI_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.2.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_96"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 6
## 
## $min_rows
## [1] 64
## 
## $nbins
## [1] 256
## 
## $nbins_cats
## [1] 512
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3480.334
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.33
## 
## $col_sample_rate
## [1] 0.42
## 
## $col_sample_rate_per_tree
## [1] 0.2
## 
## $min_split_improvement
## [1] 1e-08
## 
## $histogram_type
## [1] "UniformAdaptive"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                 "Slope"             
##  [6] "Estem"              "Branching.Distance" "Stem.Wood.Density"  "Leaf.Area"          "LMA"               
## [11] "LCC"                "LNC"                "LPC"                "d15N"               "t.b2"              
## [16] "Ks"                 "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Tree.Age"          
## 
## $y
## [1] "BAI_GR"

3.2.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bai_residuals <-
  gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BAI_GR"))),
      n.trees = 1000,
      interaction.depth = 6, #max depth 
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 10, #col_sample_rate 
      bag.fraction = 0.33, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.2.4 Relative Importance

First, we look at the importance of variables in the model.

3.2.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass Per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity Per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta 13C

Tree Age

3.2.6 Performance - True Traits Value

How does the model perform when we use the true individual trait value?

3.2.7 Performance - Species Means for Trait Value

Does the test error change when we use the average for that species?

3.2.8 Interactions | Table

Let’s explore the interactions in these data.

3.2.9 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 19.714, df = 2, p-value = 5.239e-05
##                                                                                  Comparison         Z      P.unadj
## 1 Environmental Conditions:Environmental Conditions - Plant Traits:Environmental Conditions -2.744831 6.054205e-03
## 2             Environmental Conditions:Environmental Conditions - Plant Traits:Plant Traits -4.630593 3.646202e-06
## 3                         Plant Traits:Environmental Conditions - Plant Traits:Plant Traits -2.108346 3.500109e-02
##          P.adj
## 1 1.210841e-02
## 2 1.093861e-05
## 3 3.500109e-02

3.2.10 Interaction | Group | Violin

3.2.11 Interaction | Group | Boxplot

3.2.12 Interaction | Group | Density

3.2.13 Interaction | Group | Top

3.2.14 Interactions | Plots

Now, we plot interactions with values>0.10.

Soil Fertility:Conductivity per Branch

Leaf Nitrogen Concentration:Huber Value

Slope:Vessel Diameter

Huber Value:Tree Age

Leaf Phosphorus Concentration:Percent Lumen

Percent Lumen:Percent Sapwood

Leaf Phosphorus Concentration:Tree Age

Conductivity per Branch:Tree Age

Modulus of Elasticity for Stem:Delta 13C

Slope:Tree Age

Delta 15N:Conductivity per Branch

Leaf Nitrogen Concentration:Percent Lumen

Huber Value:Percent Sapwood

Light:Branching Distance

pH:Tree Age

Branching Distance:Percent Lumen

Delta 15N:Thickness to Span Ratio

Slope:Conductivity per Branch

Branching Distance:Leaf Nitrogen Concentration

Leaf Phosphorus Concentration:Thickness to Span Ratio

Leaf Area:Huber Value

Huber Value:Percent Lumen

pH:Conductivity per Branch

Slope:Huber Value

Light:Delta 13C

Temperature:pH

Leaf Carbon Concentration:Huber Value

Slope:Thickness to Span Ratio

Soil Fertility:Light

Light:Conductivity per Branch

Branching Distance:Delta 15N

Temperature:Branching Distance

Temperature:Huber Value

Leaf Phosphorus Concentration:Percent Sapwood

Vessel Diameter:Delta 13C

Modulus of Elasticity for Stem:Leaf Carbon Concentration

Thickness to Span Ratio:Conductivity Per Sapwood Area

Temperature:Leaf Area

Conductivity per Branch:Delta 13C

Leaf Area:Leaf Nitrogen Concentration

Leaf Area:Thickness to Span Ratio

Percent Sapwood:Tree Age

Delta 15N:Huber Value

Leaf Nitrogen Concentration:Percent Sapwood

Slope:Leaf Nitrogen Concentration

Modulus of Elasticity for Stem:Delta 15N

pH:Thickness to Span Ratio

Modulus of Elasticity for Stem:Huber Value

Delta 15N:Delta 13C

Leaf Carbon Concentration:Percent Sapwood

Branching Distance:Leaf Carbon Concentration

Thickness to Span Ratio:Percent Sapwood

Leaf Nitrogen Concentration:Thickness to Span Ratio

Conductivity Per Sapwood Area:Conductivity per Branch

Thickness to Span Ratio:Conductivity per Branch

Light:Leaf Mass Per Area

Thickness to Span Ratio:Huber Value

3.2.15 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

3.3 BIO ~ SpeciesIdentity + TreeAge + EnvironmentalVariables

3.3.1 Tune Model

startTime <- Sys.time()
gbmParmBIONoDateSpecies <-
  h20_gbm_tune(path = here("data/RGRH20BIOSpecies.csv"),
             status = "BIO_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.3.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_97"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 15
## 
## $min_rows
## [1] 2
## 
## $nbins
## [1] 1024
## 
## $nbins_cats
## [1] 4096
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3563.938
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.81
## 
## $col_sample_rate
## [1] 0.91
## 
## $col_sample_rate_per_tree
## [1] 0.22
## 
## $min_split_improvement
## [1] 1e-06
## 
## $histogram_type
## [1] "QuantilesGlobal"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
## [1] "Soil.Fertility" "Light"          "Temperature"    "pH"             "Slope"          "Species"       
## [7] "Tree.Age"      
## 
## $y
## [1] "BIO_GR"

3.3.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bio_residuals_species <-
  gbm(BIO_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BIO_GR"))) %>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 15, # max depth
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 7, #col_sample_rate 
      bag.fraction = 0.81, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.3.4 Relative Importance

First, we look at the importance of variables in the model.

3.3.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Species

Tree Age

3.3.6 Performance

How does the model perform when we use the true individual trait value?

3.3.7 Interactions | Table

Let’s explore the interactions in these data.

3.3.8 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 0.52641, df = 2, p-value = 0.7686

3.3.9 Interaction | Group | Violin

3.3.10 Interaction | Group | Boxplot

3.3.11 Interaction | Group | Density

3.3.12 Interactions | Plots

Now, we plot interactions with values>0.10.

Species:Tree Age

Tree Age:Soil Fertility

Slope:Species

pH:Slope

Temperature:pH

Light:Slope

Temperature:Slope

Soil Fertility:Species

Light:Species

pH:Species

3.3.13 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

3.4 BAI ~ SpeciesIdentity + TreeAge + EnvironmentalVariables

3.4.1 Tune Model

startTime <- Sys.time()
gbmParmBAINoDateSpecies <-
  h20_gbm_tune(path = here("data/RGRH20BAISpecies.csv"),
             status = "BAI_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.4.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_56"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 3
## 
## $min_rows
## [1] 1
## 
## $nbins
## [1] 512
## 
## $nbins_cats
## [1] 512
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3585.667
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.58
## 
## $col_sample_rate
## [1] 0.74
## 
## $col_sample_rate_per_tree
## [1] 0.76
## 
## $min_split_improvement
## [1] 0
## 
## $histogram_type
## [1] "UniformAdaptive"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
## [1] "Soil.Fertility" "Light"          "Temperature"    "pH"             "Slope"          "Species"       
## [7] "Tree.Age"      
## 
## $y
## [1] "BAI_GR"

3.4.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bai_residuals_species <-
   gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BAI_GR"))) %>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 3, # max depth
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 5, #col_sample_rate 
      bag.fraction = 0.58, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.4.4 Relative Importance

First, we look at the importance of variables in the model.

3.4.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Species

Tree Age

3.4.6 Performance

How does the model perform?

3.4.7 Interactions | Table

Let’s explore the interactions in these data.

3.4.8 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 0.38788, df = 1, p-value = 0.5334

3.4.9 Interaction | Group | Violin

3.4.10 Interaction | Group | Boxplot

3.4.11 Interaction | Group | Density

3.4.12 Interactions | Plots

Now, we plot interactions with values>0.10.

Species:Tree Age

Temperature:Species

Species:Slope

Light:Species

pH:Species

Soil Fertility:Species

Light:Tree Age

pH:Slope

Soil Fertility:Light

3.4.13 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

3.5 BIO ~ SpeciesIdentity + PlantTraits + TreeAge + EnvironmentalVariables

3.5.1 Tune Model

# training the model using H20 - made a function that allows us to do this
# takes ~ 5 minutes
startTime <- Sys.time()
gbmParmBIONoDateSpeciesAgeEP <-
  h20_gbm_tune(path = here("data/RGRH20BIOSpeciesAgeEP.csv"),
             status = "BIO_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.5.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_95"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 11
## 
## $min_rows
## [1] 2
## 
## $nbins
## [1] 512
## 
## $nbins_cats
## [1] 4096
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3544.164
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.36
## 
## $col_sample_rate
## [1] 0.42
## 
## $col_sample_rate_per_tree
## [1] 0.67
## 
## $min_split_improvement
## [1] 1e-04
## 
## $histogram_type
## [1] "QuantilesGlobal"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                 "Slope"             
##  [6] "Estem"              "Branching.Distance" "Stem.Wood.Density"  "Leaf.Area"          "LMA"               
## [11] "LCC"                "LNC"                "LPC"                "d15N"               "t.b2"              
## [16] "Ks"                 "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Species"            "Tree.Age"          
## 
## $y
## [1] "BIO_GR"

3.5.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bioSpeciesAgeEP <-
  gbm(BIO_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Species",
                        "Tree.Age", "BIO_GR"))) %>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 11, # max depth
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 10, #col_sample_rate 
      bag.fraction = 0.36, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.5.4 Relative Importance

First, we look at the importance of variables in the model.

3.5.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass Per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity Per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta 13C

Species

Tree Age

3.5.6 Performance

How does the model perform?

3.5.7 Interactions | Table

Let’s explore the interactions in these data.

3.5.8 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 13.939, df = 3, p-value = 0.002989

3.5.9 Interaction | Group | Violin

3.5.10 Interaction | Group | Boxplot

3.5.11 Interaction | Group | Density

3.5.12 Interactions | Plots

Now, we plot interactions with values>0.1.

Thickness to Span Ratio:Species

Light:Conductivity Per Sapwood Area

Leaf Phosphorus Concentration:Species

Species:Soil Fertility

Leaf Carbon Concentration:Species

Leaf Carbon Concentration:Thickness to Span Ratio

Soil Fertility:Leaf Phosphorus Concentration

Species:Tree Age

Leaf Nitrogen Concentration:Species

Light:Vessel Diameter

3.5.13 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

3.6 BAI ~ SpeciesIdentity + PlantTraits + TreeAge + EnvironmentalVariables

3.6.1 Tune Model

# training the model using H20
startTime <- Sys.time()
gbmParmBAINoDateSpeciesAgeEP <-
  h20_gbm_tune(path = here("data/RGRH20BAISpeciesAgeEP.csv"),
             status = "BAI_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.6.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_92"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 12
## 
## $min_rows
## [1] 4
## 
## $nbins
## [1] 128
## 
## $nbins_cats
## [1] 2048
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3526.383
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.54
## 
## $col_sample_rate
## [1] 0.39
## 
## $col_sample_rate_per_tree
## [1] 0.76
## 
## $min_split_improvement
## [1] 0
## 
## $histogram_type
## [1] "UniformAdaptive"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                 "Slope"             
##  [6] "Estem"              "Branching.Distance" "Stem.Wood.Density"  "Leaf.Area"          "LMA"               
## [11] "LCC"                "LNC"                "LPC"                "d15N"               "t.b2"              
## [16] "Ks"                 "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Species"            "Tree.Age"          
## 
## $y
## [1] "BAI_GR"

3.6.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_baiSpeciesAgeEP <-
  gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep,"Species" ,
                        "Tree.Age", "BAI_GR")))%>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 12, #max depth 
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 10, #col_sample_rate 
      bag.fraction =  0.54, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.6.4 Relative Importance

First, we look at the importance of variables in the model.

3.6.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass Per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity Per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta 13C

Species

Tree Age

3.6.6 Performance

How does the model perform?

3.6.7 Interactions | Table

Let’s explore the interactions in these data.

3.6.8 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 14.487, df = 3, p-value = 0.002312

3.6.9 Interaction | Group | Violin

3.6.10 Interaction | Group | Boxplot

3.6.11 Interaction | Group | Density

3.6.12 Interactions | Plots

Now, we plot interactions with values>0.10

Species:Leaf Mass Per Area

Delta 15N:Species

Branching Distance:Leaf Phosphorus Concentration

Delta 13C:Species

Delta 15N:Delta 13C

Species:Tree Age

Slope:Species

Branching Distance:Species

Light:Species

Soil Fertility:Branching Distance

pH:Species

Percent Lumen:Species

Vessel Diameter:Species

Leaf Phosphorus Concentration:Percent Lumen

Light:Branching Distance

3.6.13 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.